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Abstract 

User authentication and intrusion detection differ from standard clas- 
sification problems in that while we have data generated from legitimate 
users, impostor or intrusion data is scarce or non-existent. We review 
existing techniques for dealing with this problem and propose a novel 
alternative based on a principled statistical decision-making view point. 
We examine the technique on a toy problem and validate it on complex 
real-world data from an RFID based access control system. The results 
indicate that it can significantly outperform the classical world model 
approach. The method could be more generally useful in other decision- 
making scenarios where there is a lack of adversary data. 

1 Introduction 



Classification is the problem of categorising data in one of two or more possible 
classes. In the classical supervised learning framework, examples of each class 
have already been obtained and the task of the decision maker is to accurately 
categorise new observations, whose class is unknown. The accuracy is either 
measured in terms of the rate of misclassification, or in terms of the average 
cost, for problems where different types of errors carry different costs. In that 
setting, the problem has three phases: (a) the collection of training data, (b) the 
estimation of a decision rule based on the training data and (c) the application 
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of the decision rule to new data. Typically, the decision rule remains fixed after 
the second step. Thus, the problem becomes that of finding the decision rule 
with minimum risk from the training data. 

Unfortunately, some problems are structured in such a way that it is not 
possible to obtain data from all categories to form the decision rule. Novelty 
detection, user authentication, network intrusion detection and spam filtering 
all belong to this type of decision problems: while the data of the "normal" 
class is relatively easily characterised, the data of the other class which we wish 
to detect is not. This is partially due to the potentially adversarial nature of 
the process that generates the data of the alternative class. As an example, 
consider being asked to decide whether a particular voice sample belongs to a 
specific person, given a set of examples of his voice and your overall experience 
concerning the voices of other persons. 

In this paper, we shall employ two conceptual classes: the "user" and the 
"adversary" . The main distinction is that while we shall always have examples 
of instances of the user class, we may not have any data from the adversary 
class. 

This problem is alleviated in authentication settings, where we must separate 
accesses by a specific user from accesses by an adversary. Such problems contain 
additional information: data which we have obtained from other people. This 
can be used to create a world model, which can then act as an adversary model, 
and has been used with state-of-the-art results in authentication 0, 20| . 



Since there is no explicit adversary model, the probability of an attack can- 
not be estimated. Our main contribution is a decision making principle which 
employs a pessimistic estimate on the probability of an attack. Intuitively, this 
is done by conditioning the adversary model on the current observations, whose 
class is unknown. This enables us to place an upper bound on the probability of 
the adversary class, in a Bayesian framework. To the best of our knowledge, this 
is the first time that such a Bayesian worst-case approach has been described in 
the literature. The proposed method is compared with both an oracle and the 
world model approach on a test-bench. This shows that our approach can out- 
perform the world model under a variety of conditions. This result is validated 
on the real- world problem of detecting unauthorised accesses in a building. 

The remainder of this section discusses related work. The model frame- 
work is introduced in Sec. [21 with the proposed Bayesian estimates discussed 
in Sec. 12.21 and methods for estimating the prior in Sec. 12.31 The conclusion is 
preceded by Sec. [3l which presents experiments and results. 

1.1 Related work 

Classification algorithms have been extensively used for the detection of intru- 
sions in wired 0,113] and wireless [1, 14| networks. Their main disadvantage is 



that labelled normal and attack data must be available for training. After the 
training phase, the classifier's learnt model will be used to predict the labels of 
new unknown data. However, such data is very hard to obtain and often unre- 
liable. Finally, there will always exist new unknown attacks for which training 



data are not available at all. 

Outlier detection 0, [l^ and clustering [l^ use unlabelled data and are in 
principle able to detect unknown types of attacks. The main disadvantage is 
that no explicit adversarial model is employed. 

An alternative framework is the world model approach [l^, This 
is extensively used in speech and image authentication problems, where data 
from a considerable number of users are collected to create a world model (also 
called a universal background model). This approach is closely related to the 
model examined in this paper, since it originates in the seminal work of [isj , who 
employed an empirical Bayes technique for estimating a prior over models. Thus, 
the world model is a distribution over models, although due to computational 
considerations a point estimate is used instead in practice [20[ . 

The adversary may actively try to avoid detection, through knowledge of the 
detection method. In essence, this changes the setting from a statistical to an 
adversarial one. For such problems, game theoretic approaches are frequently 
used. Dalvi et al. Q investigated the adversarial classification problem as 
a two-person game. More precisely, they examined the optimal strategy of 
an adversary against a standard (adversary-unaware) classifier as well as that 
of a classifier (adversary-aware) against a rational adversary. This was under 
the assumption that the adversary has complete knowledge of the detection 
algorithm. In a similar vein, Lowd et al. 15] have investigated algorithms 
for reverse engineering linear classifiers. This allows them to retrieve sufficient 
information to mount effective attacks. 

In our paper we do not consider repeated interactions and thus we do not 
follow a game-theoretic approach. We instead consider how to model the ad- 
versary, when we have a lot of data from legitimate users, but no data from 
the adversary. Our main contribution is a Bayesian method for calculating a 
subjective upper bound on attack probabilities without any knowledge of the 
adversary model. This can be obtained simply by using the current (unlabelled) 
observations to create a worst-case (or more generally pessimistic) model of the 
adversary^ This is done by conditioning the prior over adversary models ac- 
cording to new (unlabelled) observations. 

However, in order to control overfitting, we first condition the adversary 
model's prior on the data of the remaining population of users. This results in 
an empirical Bayes estimate of the prior |.21i] . which is what the world model 
approach essentially is The prior then acts as a soft constraint when 

selecting the worst-case adversary model. 

It is worthwhile to note that the problem of constructing a model for a class 
with no data is related to the problem of null hypothesis testing, for which sim- 
ilar ideas have appeared. For example, jlQ] explored the idea of constructing a 
maximum likelihood estimate from the obsrvations and using this as the alter- 
native hypothesis. More sophisticated examples for simple parametric problems 
were examined in 0]. This involved selecting the worst-case prior from a given 
class of priors in order to be maximally pessimistic about the null hypothesis. 

^ Some simpler alternative approaches are explored in an accompanying technical report . 



Our approach is similar in spirit, but the apphcation and technical details are 
substantially different. 

Our final contribution is an experimental analysis on a synthetic problem, 
as well as on some real- world data, with promising results: we show that the 
widely used world model approach cannot outperform the proposed model. 

2 The proposed model framework 

In the framework we consider, we assume that the set of all possible models is 
M. Each model /i in Al is associated with a probability measure over the set 
of observations X, which will be denoted by /x(x) for x £ X, ^ & M, so long 
as there is no ambiguity. We must decide whether some observations x € X, 
have been generated by a model q (the user) or a model w (the adversary) in 
M. Throughout the paper, we assume a prior probability of the user having 
generated the data, P(g), with a complementary prior P(u') = 1 — P(?), for the 
adversary. 

In the easiest scenario, we have perfect knowledge oi q,w ^ Ai. It is then 
trivial to calculate the probability P{q\x) that the user q has generated the data 
X. This is the oracle decision rule, defined in section [2TT1 This is not a realisable 
rule, as although we could accurately estimate q with enough data, in general 
there is no way to estimate the adversary model w. 

We thus consider the case where the user model is known and where we 
are given a prior density ^{w) over the possible adversary models w E A4. 
Currently seen observations are then used to form a pessimistic posterior ^' for 
the adversary. This is explained in Section [^T^ 

Section 12.31 discusses the more practical case where neither the user model 
q, nor a prior ^ over models M are known, but must be estimated from data. 
More precisely, the section discusses methods for utilising other user data to 
obtain a prior distribution over models. This amounts to an empirical Bayes 
estimate of the prior distribution |21|]. It is then possible to estimate q by 
conditioning the prior on the user data. This is closely related to the adapted 
world model approach [20| . used in authentication applications, which however, 
usually employs a point approximation to the prior [5|. 

2.1 The oracle decision rule 

We shall measure the performance of all the models against that of the oracle 
decision rule. The oracle enjoys perfect information about the distribution of 
both the user and the adversary, and thus knows both q and w, as well as the 
a priori probability of an attack, P(w). On average, no other decision rule can 
do better. 

More precisely, let A4 be the space of all models. Let the adversary's model 
be w and the user's model be q, with q,w Cz A4. Given some data x, we would 
like to determine the probability that the data x has been generated by the 
user, P{q\x). The oracle model has knowledge of w, q and P(g), so using Bayes' 



rule we obtain: 

However, we usually have uncertainty about both the adversary and the user 
model. Concerning the adversary, the uncertainty is much more pronounced. 
The next section examines a model for the probability of an attack when the 
user model is perfectly known but we only have a prior S,(w) for the adversary 
model. 

2.2 Bayesian adversary model 

We can use a subjective prior probability ^{w) over possible adversary models, 
to calculate the probability of observations given that they have been generated 
by the adversary: ^(x) = Jj^'w{x)^{w)d'w^ Given a user model q, we can 
express the probability of the user q given the observations x under the belief ^ 
as: 

«'l-'"'''^l-'^' %WP(/1S(i-PM) - 

The difference with ([T]) is that, instead ofw{x), we use the marginal density ^{x). 
If ^{w) represents our subjective belief about the adversary model w, then ^ 
can be seen as the Bayesian equivalent of the world model approach, where the 
prior over w plays the role of the world model. Now let: ^'(w) = £,{w\x) be the 
model posterior for some observations x. We shall need the following lemma: 

Lemma 2.1. For any probability measure ^ on A4, where A4 is a space of 
probability distributions on X , such that each ^ £ M. defines a probability (den- 
sity) nix) with X ^ X , with admissible posteriors £,'{n) = the marginal 
likelihood satisfies: Ci^) > Vx £ X. 

A simple proof, using the Cauchy-Schwarz inequality on the norm induced 
by the measure ^, is presented in the Appendix. From the above lemma, it 
immediately follows that: 

aq\x) > ('{q\x) = , ,^ '^^^]^,['^} TWrTT' 

q{x)P{q) + (1 - P{q)) w{x)^'{w) dw 

since ^'{x) — J w{x)£_'{w)dw > J w{x)£^'{w)dw = £,{x). Thus ([3]) gives us a 
subjective upper bound on the probability of the data x having been generated 
by the adversary. This bound can then be used to make decisions. Finally, note 
that we can form ^'(w) on a subset of x. This possibility is explored in the 
experiments. 



■^Here we used the fact that ^{x\w) = w{x), since the probability of the observations given 
a specific model w no longer depends on our belief ^ about which model w is correct. 



2.3 Prior and user model estimation 



Specifically for user authentication, we have data from two sources. The first 
is data collected from the user which we wish to identify. The second is data 
collected from other persons!! The i-th person can be fully specified in terms of 
a model G M, with /i^ drawn from some unknown distribution 7 over M. If 
we had the models fit G M for all the other people in our dataset, then we could 
obtain an empirical estimate 7 of the prior distribution of models. Empirical 



Bayes methods for prior estimation [2lj extend this procedure to the case where 



we only observe x ^ ^i, data drawn from the model ^i. 

Let us now apply this prior over models to the estimation of the posterior 
over models for some user. Given an estimate 7 of 7, and some data x ^ ^ 
from the user, and assuming that ~ 7, we can form a posterior for /x using 
Bayes rule: ^{^m\x) = ^{x)S^{fi)/ Jj^ ■y{x\fi)£,{d^), over all fj, £ M. For a specific 
user k with data Xk, we write the posterior as ipkifJ^) — 7(mI^/c)- Whenever we 
must decide the class of a new observation x, we set the prior over the adversary 
models to ^ = 7 and then condition on part, or all, of x to obtain the posterior 
£,'{w). We then calculate 

= (4) 

[i/jk{x)F{qk) + (1 - F{qk))^'{x)] 

the posterior probability of the fc-th user given the observations x and our beliefs 
^' and ipk over adversary and user models respectively. When ^' = ^, we obtain 
an equivalent to the world model approach of ^2Q], which is an approximate 
form of the empirical Bayes procedure suggested in [13| . 



2.3.1 Prior estimation for multinomial models 

For discrete observations, we can consider multinomial distributions drawn from 
a Dirichlet density, and use a maximum likelihood estimate based on Polya 
distributions for 7. More specifically, we use the fixed point approach suggested 
in [l6|] to estimate Dirichlet parameters $ from a set of multinomial observations. 

To make this more concrete, consider multinomial observations of degree K. 
Our initial belief ^(/x) is a Dirichlet prior with parameters $ = {(f>i, . . . , (f>K) over 
models: = OiLi /^f which is conjugate to the multinomial Q- 

Given a sequence of observations Xi, . . . , a;„, with Xt G 1, . . . where each 
outcome i has fixed probability /i^, then Ci — X]"=i -"-(^t ~ where I is an 
indicator function, is multinomial and the posterior distribution over the pa- 
rameters /ii is also Dirichlet with parameters (j)'^ — (pi + q. The approach 
suggested in [3| uses the following fixed point iteration for the parameters: 
= (bi ^^,!} ) '11^ ) . , where ^-f-) is the digamma function. 



These are not necessarily other users. 
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Figure 1: The evolution of error rates as more data becomes available, when the 
user model and prior are estimated. The points indicate means from 10^ runs 
and the lines top and bottom 5% percentiles from a bootstrap sample. 

3 Experimental evaluation 

We have performed a number of experiments in order to evaluate the proposed 
approach and compared it to the full Bayesian version of the well-known world 
model approach. We performed a set of experiments on synthetic data, and 
another set of experiments on real data. 

For the synthetic experiments, we assume multinomial models, but rather 
than knowing 7, we use data from other users to form an empirical estimate 7, 
as described in Sec. 12.3.11 Furthermore, q is itself unknown and is estimated 
via Bayesian updating from 7 and some data specific to the user. We then 
compare the oracle and the world model approach (based on 7) with a number 
of differently biased adversary models. The world model is based on the estimate 
7. The adversary model uses the world model 7 as the adversary prior (^). 

The second group concerns experiments on data gathered from an access 
control system. The data has been discretized into 1320 integer variables, in 
order for it to be modelled with multinomials. The models are of course not 
available so we must estimate the priors: The data of a subset of users is used 
to estimate 7. The remaining users alternatively take on the roles of legitimate 
users and adversaries. 

We compare the following types of models, which correspond to the legends 




in the figures of the experimental results, (a) The oracle model, which enjoys 
perfect information concerning adversary and user distributions, (b) The world 
model, which uses the prior over user models as a surrogate for the adversary 
model, (c) The bias world model, which uses all but the last observation to 
obtain a posterior over adversary models, and similarly: (d) the f bias world 
model, which uses all observations, (e) the p bias world model, which weighs 
the observations by 1/2 and (f) the n bias world model, which uses the first 
half of the observations. In all cases, we used percentile calculations based 
on multiple runs and/or bootstrap replicates [ll| to assess the significance of 
results. 



3.1 Synthetic experiments 

For this evaluation, we ran 10** independent experiments and employed multino- 
mial models. For each experiment, we first generated the true prior distribution 
over user models 7. This was created by drawing Dirichlet parameters in- 
dependently from a Gamma distribution. We also generated the true prior 
distribution over adversary models 7', by drawing from the same Gamma dis- 
tribution. Then, a user model q was drawn from 7 and an adversary model w 
was drawn from 7'. Finally, by flipping a coin, we generated data xi, . . . ,Xn 
from either q or w. Assuming equal prior probabilities of user and adversary, we 
predicted the most probable class and recorded the error. This was done for all 
subsequences of the observations' sequence x. Thus, the experiment measures 
the performance of methods when the amount of data that informs our decision 
increases. 

For these experiments, we estimate the actual Dirichlet distribution with 7. 
This estimation is performed via empirical Bayes using data from 1000 users 
drawn from the actual prior 7. At the fc-th run, we draw a user model qk ^ 7 
and subsequently draw Xk ^ qk- We then use 7 and the user data Xk G , 
to estimate a posterior over user models for the k-th user, ipkiq) — 7(<?|a^fc)- 
The estimated prior 7 is also used as the world model and as the prior over 
adversary models. The results, shown in Figure[Tl show that the biased models 
consistently outperform the classic world model approach, while the partially 
biased models become significantly better than the fully biased models when 
the amount of observations increases. This is encouraging for application to 
real-world data. 



3.2 Real data 

The real world data were collected from an RFID based access control system 
used in two buildings of the TNO organization (Netherlands Organization for 
Applied Scientific Research) . The data were collected during a three and a half 
month period, and they include successful accesses of 882 users, collected from 
55 RFID readers granting access to users attempting to pass through doors in 
the buildings. 



The initial data included three fields: the time and date that the access 
has been granted, the reader that has been used to get access and the ID of 
the RFID tag usetfl. In order to use the data in the experimental evaluation 
of the proposed model framework, we have discretized the time into hour-long 
intervals, and counted the number of accesses, per hour, per door for each user, 
in each day. This resulted in a total of « 2 • 10^ records. Since there are 24 hour- 
long slots in a day, and a total of 55 reader-equipped doors, this discretisation 
allowed us to model each user by a 1320-degree multinomial/Dirichlet model. 
Thus, even though the underlying Dirchlet/multinomial model framework is 
simple, the very high dimensionality of the observations makes the estimation 
and decision problem particularly taxing. 

3.2.1 Experiments 

We performed 10 independent runs. For the fc-th run, we selected a random 
subset U-y of the complete set of users U , such that |C/-y|/|[/| = 2/3. We used 
U-y to estimate the world model 7. The remaining users Ut — U\Uj were used 
to estimated the error rate over 10'^ repetitions. For the j-th repetition, we 
randomly selected a user i £ Ut with at least 10 records Di. We used half of 
those records, Di, to obtain ipiil) — By flipping a coin, we obtain 

either (a) one record from Di\Di, or (b) data from some other user in Ut- 
Let us call that data Xj. For the biased models, we set ^ = 7 and then used 
Xj to obtain (^{w\f{xj)), where /(•) denotes the appropriate transformation. 
Figure [2] shows results for the baseline world model approach (world) , where 
f{x) = 0, as the unmodified world model is used for the adversary, the full 
bias approach (f bias), where f{x) = x since all the data is used, and finally 
the partial bias approach (p bias) where f{x) = x/2. The other approaches 
are not examined, as the oracle is not realisable, while the half-data and the 
all-but-last-data biased models are equivalent to the baseline world model, since 
we do not have a sequence of observations, but only a single record. 

As can be seen in Figure [21 the baseline world model is always performing 
worse than the biased models, though in two runs the full bias model is close. 
Finally, though the two biased models are not distinguishable performance-wise, 
we noted a difference in the ratio of false positives to false negatives. Over the 
10 runs, this was 0.2 ± 0.1 for the world model approach, 2.5 ± 0.5 for the fully 
biased model, and 0.9 ± 0.2 for the partially biased model. 

4 Conclusion 

We have presented a very simple, yet effective approach for classification prob- 
lems where one class has no data. In particular, we define a prior over models 
which can be estimated from population data. This is adapted, as in the stan- 
dard world-model approach, to a specific user. We introduce the idea of creating 



*The data were sanitised to avoid privacy issues. 
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Figure 2: Error rates for 10 runs on the TNO door data. The error bars indicate 
top and bottom 5% percentiles from 100 bootstrap samples fr-om 10'^ repetitions 
per run. 



an adversary model, for which no labelled data exists, from the prior and cur- 
rently seen data. Within the subjective Bayesian framework, this allows us to 
obtain a subjective upper bound on the probability of an attack. 

Experimentally, it is shown that: (a) we outperform the classical world 
model approach, while (b) it is always better to only partially condition the 
models on the new observations. 

It is possible to extend the approach to the cost-sensitive case. Since we 
already have bounds on the probability of each class, together with a given cost 
matrix, we can also calculate bounds on the expected cost. This will allow us 
to make cost-sensitive decisions. 

A related issue is whether to alter the a priori class probabilities; in our 
comparative experiments we used equal fixed values of 0.5. It is possible to 
utilise the population data to tune it in order to achieve some desired false 
positive / negative ratio. Such an automatic procedure would be useful for 
an expected performance curve [l| comparison between the various approaches. 
Finally, since the experiments on this relatively complex problem gave promising 



^In an accompanying technical report the effect of dimensionality on the performance 
of the method is also examined. There, it is shown that a Bayesian framework is essential 
for such a scheme to work and that naive approaches perform progressively worse as the 
dimensionality increases. 



results, we plan to evaluate it on other problems that exhibit a lack of adversarial 
data. 



A Proof 

Lemma \2.1\ For discrete M., the marginal prior £,{x) can be re- written as fol- 
lows: 

MM M 

and similarly: ^'(x) = ^ ;j(!r){(p) Thus, to prove the required 

statement, it is sufficient to show 

1/2 

~ (6) 



Ea^(^)'^(a^) >E/^(^)^(^)- 

V, M / A* 



Similarly, for continuous A^, we obtain: 

.1/2 

> j ^^{x)di{^Ji). (7) 

In both cases, the norm induced by the probability measure ^ on is ||/||2 = 
(/m d^if^)y^^^ thus allowing us to included apply the Cauchy-Schwarz 

inequality ||/5||i < ||/||2||5||2- By setting f{fi) = fi{x) and g{fi) = I, we ob- 
tain the required result, since \\g\\2 = '^?(/^))^^^ = I, as ^ is a probability 
measure. □ 
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